大型电子商务产品数据中的嘈杂标签(即,将产品项放入错误类别)是产品分类任务的关键问题,因为它们是不可避免的,不足以显着删除和降低预测性能。培训数据中对数据中嘈杂标签的产品标题分类模型对于使产品分类应用程序更加实用非常重要。在本文中,我们通过比较我们的数据降低算法和不同的噪声抗压训练算法来研究实例依赖性噪声对产品标题分类的性能的影响,这些算法旨在防止分类器模型过度拟合到噪声。我们开发了一个简单而有效的深度神经网络,用于将产品标题分类用作基本分类器。除了刺激实例依赖性噪声的最新方法外,我们还提出了一种基于产品标题相似性的新型噪声刺激算法。我们的实验涵盖了多个数据集,各种噪声方法和不同的训练解决方案。当噪声速率不容易忽略时,结果揭示了分类任务的限制,并且数据分布高度偏斜。
translated by 谷歌翻译
We present RecD (Recommendation Deduplication), a suite of end-to-end infrastructure optimizations across the Deep Learning Recommendation Model (DLRM) training pipeline. RecD addresses immense storage, preprocessing, and training overheads caused by feature duplication inherent in industry-scale DLRM training datasets. Feature duplication arises because DLRM datasets are generated from interactions. While each user session can generate multiple training samples, many features' values do not change across these samples. We demonstrate how RecD exploits this property, end-to-end, across a deployed training pipeline. RecD optimizes data generation pipelines to decrease dataset storage and preprocessing resource demands and to maximize duplication within a training batch. RecD introduces a new tensor format, InverseKeyedJaggedTensors (IKJTs), to deduplicate feature values in each batch. We show how DLRM model architectures can leverage IKJTs to drastically increase training throughput. RecD improves the training and preprocessing throughput and storage efficiency by up to 2.49x, 1.79x, and 3.71x, respectively, in an industry-scale DLRM training system.
translated by 谷歌翻译